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Abstract 

We demonstrate the implementation of a hybrid OpenMP and MPI 
parallelization of a conservative spectral method for the Boltzmann equa- 
tion originally developed by Gamba and Tharkabhushaman. We perform 
a scaling analysis to demonstrate that the problem is well suited to par- 
allelization, and find that the computational time scales linearly with the 
number of compute nodes on high performance computing resources. The 
original method has also been improved to higher order in space and time 
and is implemented on non-uniform grids in physical space. We test this 
scheme for an example problem in which a kinetic boundary layer gener- 
ates a shock wave for large space and long times. This is the first time 
that the fully nonlinear Boltzmann collision operator has been used to 
compute this problem. 

1 Introduction 

In this paper we consider the solution of the fully nonlinear space inhomogenous 
Boltzmann equation, which is an integrodifferential equation that describes the 
evolution of a dilute gas where only binary interactions between particles are 
considered. In particular, we extend the conservative spectral method developed 
by Gamba and Tharkabhushanam [21, 22] for use on high performance com- 
puting resources using OpenMP and MPI, and study the evolution of a kinetic 
boundary layer problem that was originally computed by Aoki et al [3] using the 
Bhatnagar-Gross-Krook (BGK) approximation for collisions. There are many 
difficulties associated with numerically solving the Boltzmann equation, most 
notably the dimensionality of the problem and the conservation of the collision 
invariants. For physically relevant three dimensional applications the distribu- 
tion function is seven dimensional and the velocity domain is unbounded. In 
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addition, the collision operator is nonlinear and requires evaluation of a five 
dimensional integral at each point in phase space. The collision operator also 
locally conserves mass, momentum, and energy, and any approximation must 
maintains this property to ensure that macroscopic quantities evolve correctly. 

Most numerical computation of the collision operator is based on stochastic 
Monte Carlo methods, for example the methods of Bird [5] and Nanbu [31]. 
These methods strongly reduce the dimensionality of the problem by approxi- 
mating the collision mechanism through a stochastic, particle-based description, 
which ensures that time and memory is not wasted on computing regions where 
/ is near zero. Because these methods use the microscopic collision mechanism 
directly they exactly conserve the quantities such as mass or energy. However, 
the stochastic nature of Monte Carlo methods give rise to statistical fluctua- 
tions in the numerical results. The amount of computational resources required 
grow very quickly in nonsteady problems, as well as problems where collisions 
strongly dominate the system, flows with high mean velocities, and problems 
with a distribution function that is far from equilibrium. In recent years deter- 
ministic methods have been able to catch up with Monte Carlo solvers, obtaining 
similar results with relatively coarse meshes and providing a faster rate of con- 
vergence without having to worry about the fluctuations or how close they are 
to Maxwellians [13]. 

To avoid these problems, a deterministic class of so called discrete velocity 
methods (DVM) were proposed. Rather than using a random collection of 
particles to collide, these methods discretize the velocity space on a finite grid in 
such a way that the collision mechanics are satisfied exactly for pairs of velocity 
grid points. This method originated with Broadwcll [10], and was later extended 
by others [12, 11, 25, 36]. This approach requires careful choice of grid points 
in velocity space and on the unit sphere to preserve the conserved quantities. 
However, while these methods are able to satisfy the conservation properties of 
the collision operator, many have observed accuracy of only 0(N~ d / 2 ), where 
N is the total number of velocity grid points in each dimension, despite being as 
computationally expensive as naively applying a standard quadrature formula 
directly to the collisional integral without any regard for conservation [7, 33]. 
Recent work by the group of Varghese [30] uses a combined Monte-Carlo and 
discrete velocity formulation, which selects random pairs of velocity grid points 
for collision, then conservatively interpolates the result back onto the grid. This 
method is less noisy than classical DSMC, and outperforms it in regions where 
there is less activity. 

Spectral methods are a deterministic approach that compute the collision 
operator to high accuracy by exploiting its Fourier structure. These methods 
grew from the analytical works of Bobylev [6] developed for the Boltzmann 
equation for Maxwell type potential interactions and integrable angular cross 
section, where the corresponding Fourier transformed equation takes a closed 
form. Spectral methods provide many advantages over Direct Simulation Monte 
Carlo Methods (DSMC) because they are more suited to time dependent prob- 
lems, low Mach number flows, high mean velocity flows, and flows that are away 
from equilibrium. In addition, deterministic methods avoid the statistical fluc- 
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tuations that are typical of particle based methods. Spectral approximations 
for this type of models where first proposed by Pareschi and Perthame [34]. 
Later Pareschi and Russo [35] applied this work to variable hard potentials by 
periodizing the problem and its solution and implementing spectral collocation 
methods. 

Using this representation, Bobylev and Rjasanow developed methods for the 
case of Maxwell molecules [8] and hard spheres [9] to derive the convolution 
and approximate it with numerical quadrature. Ibragimov and Rjasanow ex- 
tended these ideas to the more general case of variable hard spheres [24] and 
simplified the integration domain by explicitly computing the spherical inte- 
gral in the derivation of the weight function for the convolution. In a related 
work, Pareschi and Russo [35] applied this framework to develop methods for 
the Maxwell molecules, hard spheres, and variable hard spheres cases using a 
collocation method, which uses orthogonal polynomials to reduce the convolu- 
tion integral to a convolution sum, providing spectral accuracy to the collision 
integral computation. 

Inspired by the work of Ibragimov and Rjasanow [24], Gamba and Tharkab- 
hushanam [21, 22] observed that the Fourier transformed collision operator takes 
a simple form of a weighted convolution and developed a spectral method based 
on the weak form of the Boltzmann equation that provides a general framework 
for computing both elastic and inelastic collisions. Macroscopic conservation 
is enforced by solving a numerical constrained optimization problem that finds 
the closest distribution function in L 2 to the output of the collision term that 
conserves the macroscopic quantities. These methods do not rely on periodiza- 
tion but rather on the use of the fast Fourier transform in the computational 
domain, and convergence to the solution of the continuous problem is obtained 
by the use of the extension operator in Sobolev spaces [2] . 

All of these methods are computationally expensive, so there is a natu- 
ral desire to extend them to high performance computing resources. However, 
there have been relatively few published works on parallelization of Boltzmann 
solvers. Graphics Processing Units (GPUs) have emerged as a resource for ap- 
plications other than graphics, providing hundreds to thousands of lightweight 
independent processors already geared to independently operating on large sets 
of data. Frezzotti and collaborators [16, 15, 14] implemented DSMC and BGK 
type Boltzmann models on GPUs, and in parallel Malkov and Ivanov and col- 
laborators have explored implementing classical DSMC as well as the discrete 
velocity Monte Carlo of Varghese et al. on GPUs [28, 29, 26]. Many existing 
codes cannot be ported directly to GPUs, due to their relatively simple instruc- 
tion set and complicated memory management structure. In 2012, Intel released 
a new architecture known as Many Integrated Core (MIC). This architecture 
combines many more powerful cores on a single chip than ever before, all able 
to access the same pooled memory. These cores are classical CPUs, rather than 
GPUs, which allows relatively straightforward speedup of existing codes [19]. 

This paper presents the extension of the deterministic spectral method of 
Gamba and Tharkabhushanam to CPU-based parallelization using the OpenMP 
and MPI APIs, as opposed to GPUs. We also present a second order in time 
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and space extension of this method that implements nonuniform grids. Sec- 
tions 2 and 3 are presented for the benefit of the reader; much of the spectral 
formulation can be found in [21, 22]. This method has been tested on the 
Ranger supercomputer at the Texas Advanced Computing Center (TACC) and 
timing studies are provided to explore the scaling of the method to more and 
more processors for large problems, and will serve as a stepping stone to future 
implementation on MIC architecture. We present some ID in physical space 
results for a problem proposed by Aoki et al.[3] where a sudden change in wall 
temperature results in a shock that cannot be explained by classical hydrody- 
namics. 

2 The space inhomogeneous Boltzmann equa- 
tion 

The space inhomogeneous initial-boundary value problem for the Boltzmann 
equation is given by 

^/(x, v, t) + v • Vx/(x, v, t) = ±Q(/, /), (1) 

with 

xeSlcR^, v e R d 

/(x,v,0) = / (x,v) 

/(x,v,t) = / B (x,v,i), x e 90. 

where f(v,t) is a probability density distribution in v-space and /o is assumed 
to be locally integrable with respect to v and the spatial boundary condition 
/b will be specified in below. The dimensionless parameter e > is the scaled 
Knudsen number, which is defined as the ratio between the mean free path 
between collisions and a reference macroscopic length scale. 

The collision operator Q(f, f)(x,v,t) is a bilinear integral form local in t 
and x and is given by 

Q(/,/)(-,v,-)= / / B(\v-v*\,cose)(f(v:)f(v')-f(v*)f(v))dadv*, 

(2) 

where the velocities v', v'„ are determined through a collision rule (3), depending 
on v, v*„ and the positive term of the integral in (2) evaluates / in the pre- 
collisional velocities that will take the direction v after an interaction. The 
collision kernel B(\v — v*|,cos#) is a given non- negative function depending 
on the size of the relative velocity u := v — v* and cos 8 = 3 r£, where a in 
the n—1 dimensional sphere S" 1-1 is referred as the scattering direction of the 
post-collisional elastic relative velocity. 
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For the following we will use the velocity elastic (or reversible) interaction 
law in center of mass-relative velocity coordinates 

v' = v+i(|u|a-u), <=v*-^(|u|<7-u) (3) 

£(|u|,cos6>) = |u| A 6(cos0). 

We assume that the differential cross section function &(cos 9) is integrable with 
respect to a on referred to as the Grad cut-off assumption, and that it is 

renormalized such that 



b(cos6)da = 1. (4) 

g d-i 

The parameter A regulates the collision frequency as a function of the relative 
speed |u|. This corresponds to the intcrparticle potentials used in the derivation 
of the collisional kernel and are referred to as variable hard potentials (VHP) for 
< A < 1, hard spheres (HS) for A = 1, Maxwell molecules (MM) for A = 0, and 
variable soft potentials (VSP) for — 3 < A < 0. The A = — 3 case corresponds to 
a Coulombic interaction potential between particles. If &(cos 9) is independent 
of a we call the interactions isotropic (like the case of hard spheres in three 
dimensions) . 

Depending on the nature of the collisions, the collision operator can have a 
number of collision invariants. In the case of the classical Boltzmann equation 
with elastic collisions, according to the Boltzmann Theorem the only collisional 
invariants are polynomials of the form A + B ■ v + C|v| 2 . These give rise to the 
classical macroscopic conserved quantities 

P(^t) = / /(*,v,t)dv (density) 

J V 

p(x, t)V(x, t) = / v/(x, v, t)dv (momentum) 

J v 

p(x, t)e(x, t) = — ( |v| 2 /(x, v, t)dv (kinetic energy density) (5) 

^ Jv 

2.1 Boundary conditions 

On the spatial boundary dfl we use a diffusive Maxwell boundary condition 
which is given by, for x <G dfl, 



/(x ' V ' t)= (2^T; M 2) CX H'^^J' (Y-V„).n>0 (6) 
ct ™ = -(-JtH / (v- V w )-n/(x,v,i)dv, 

\tLl w J J(v-V„)-n<0 

where V„, and T w are the wall velocity and temperature, respectively, and n is 
the unit normal vector to the boundary, directed into il. The term a w accounts 
for the amount of particles leaving the domain and ensures mass conservation 
in Q,. 
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2.2 Spectral formulation 

The key step our formulation of the spectral numerical method is the use of 
the weak form of the Boltzmann collision operator. For a suitably smooth test 
function </>(v) the weak form of the collision integral is given by 

/ 0(/,/)^(v)dv= / /(v)/(v,)i?(|u|,cos0)(0(v')-0(v))dadv^v 

JR d JR d xR d xS d - 1 

(7) 

If one chooses 

0(v) = e~^/{y/^) d , 
then (7) is the Fourier transform of the collision integral with Fourier variable 



-j 

Jm 



ft \t( \ B ( U ' C0S ^)/ -»C-V' "iC'VXj 7 , 

M d xR d xS d - 1 (V2TT) a 



J 



G(u,CW(v)/(v-u)](C)du, (8) 

where [•] = denotes the Fourier transform and 

G(u,() = \u\ x J 6(cos0)(e-^ c - |u|<T e^ c - u -l)rfa (9) 

Further simplification can be made by writing the Fourier transform inside the 
integral as a convolution of Fourier transforms: 

where the convolution weights G(£, C) are given by 

G%Q = -jL-( G(u,0e-* u du (11) 

(V27t)< 1 JRd 

These convolution weights can be precomputed once to high accuracy and stored 
for future use. For many collision types the complexity of the integrals in the 
weight functions can be reduced dramatically through analytical techniques. In 
this paper we will only consider isotropic scattering in dimension 3 (6(cos 6) = 
1/4tt). In this case we have that 

G(u, C) = — f e -4(\"\-° e l ^- u - 1 da 

47T J S 2 

= |„|* fe^«sinc - l) (12) 
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To calculate Q, we have 



(13) 



This integral will be cut off at a point r — r , which will be determined 
below. Given this cutoff point, we can explicitly compute G for integer values 
of A. 

rip r\ 4tt ( Q 2 (P r o sin(pr ) + cos(pr ) - 1) - p 2 (qr Q sin(qr ) + cos(<?r ) - 1) 

(2 - r 2 |C| 2 ) cos(ro|e|) + 2r |£| sin(r |^|) - 2 x 
rvt /-i 4vr ^ gsin(pro) -psin( g r ) sin(|g|r ) - |g|r cos(|g|r ) ^ 

For other values of A, this is simply a one-dimensional integral that can be 
precomputed to high accuracy using numerical quadrature. The entirety of the 
collisional model being used is encoded in the weights, which gives the algorithm 
a large degree of flexibility in implementing different models. 



3 The Conservative Numerical Method 

3.1 Temporal and velocity space discretization 

We use an operator splitting method to separate the mechanisms of collisions 
and advection. The system is split into the subproblcms 

^/ + «-V x / = (14) 

|/ = W,/), (15) 

which are solved separately. 

Each system is evolved in time using a second-order Runge-Kutta method, 
and the systems are combined using Strang splitting. 

In order to compute the Boltzmann equation we must work on a bounded 
velocity space, rather than all of R d . However typical distributions are sup- 
ported on the entire domain, for example the Maxwellian equilibrium distribu- 
tion. Even if one begins with a compactly supported initial distribution, each 
evaluation of the collision operator spreads the support by a factor of y/2, thus 
we must use a working definition of an effective support of size R for the distri- 
bution function. Bobylev and Rjasanow [9] suggested using the temperature of 
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the distribution function, which typically decreases as exp(— \v\ 2 /2T) for large 
\v\, and used a rough estimate of R w 2\[2T to determine the cutoff. We assume 
that the distribution function is negligible outside of a ball 

Bfl.(V(x)) = {veR d :|v- V(x)| < R x }, (16) 

where V(x) is the local flow velocity which depends in the spatial variable x. 
For ease of notation in the following we will work with a ball centered at and 
choose a length R large enough that Br x (V(x)) C -Br(O) for all x. 

With this assumed support for the distribution /, the integrals in (10) will 
only be nonzero for u G B 2 r(0). Therefore, we set L — 2R and define the cube 

C L = {v G R d : \vj\ < L, j = 1, . . . ,d} (17) 

to be the domain of computation. For such domain, the computation of the 
weight function integral (13) is cut off at ro = L. 

Let N G N be the number of points in velocity space in each dimension. 
Then the uniform velocity mesh size is Av — ^ and due to the formulation 
of the discrete Fourier transform the corresponding Fourier space mesh size is 
given by AC = \- 

The mesh points are defined by 

v k = Av(k - N/2) 

C k = AC(k - N/2) (18) 
k= (fa,..., fc d ) G Z d , < kj < N- 1, j = l,...,d (19) 

3.2 Collision step discretization 

Returning to the spectral formulation (10), the weighted convolution integral 
then becomes an integral over — < £j < J ' — 1, ■ • • , d. 

Similar to what was noted in [24] , we can find the cutoff for the integration 
variable u through the term 

.9(u, v) = /(v)/(v - u) 

that appears in the Fourier transform term in (8). As supp/ = Bl(0), we have 
that 

sappg(u,-) = B L (0)nB L (u) 

To simplify notation we will use one index to denote multidimensional sums 
with respect to an index vector m 

N-1 N-1 

E= E • 

m— mi r ..,md-0 

To compute Q((k), we first compute the Fourier transform integral giving 
/(0c) y i & the FFT. The weighted convolution integral is approximated using the 



8 



trapezoidal rule 



N-l 

Q(Ck) = £ Ck)/Km)/(Ck - £m)w m , (20) 

m=0 

where w m is the quadrature weight and we set /(Ck — £m) = if (Ck — £m) is 
outside of the domain of integration. We then use the inverse FFT on Q to 
calculate the integral returning the result to velocity space. 

Note that in this formulation the distribution function is not pcriodized, as 
is done in the collocation approach of Pareschi and Russo [35] . This is reflected 
in the omission of Fourier terms outside of the Fourier domain. All integrals 
are computed directly only using the FFT as a tool for faster computation. The 
convolution integral is accurate to at least the order of the quadrature. The 
calculations below use the trapezoid rule, but in principle Simpson's rule or 
some other uniform grid quadrature can be used. However, it is known that the 
trapezoid rule is spectrally accurate for periodic functions on periodic domains 
(which is the basis of spectral accuracy for the FFT), and the same arguments 
can apply to functions with sufficient decay at the integration boundaries [4]. 
These accuracy considerations will be investigated in future work. The overall 
cost of this step is 0{N 2d ). 

3.3 Discrete conservation enforcement 

This implementation of the collision mechanism does not conserve all of the 
quantities of the collision operator. To correct this fact, we formulate these 
conservation properties as a constrained optimization problem as proposed in 
[21, 22]. Depending on the type of collisions we can change this constraint set 
(for example, inelastic collisions do not preserve energy). We focus here just on 
the case of elastic collisions, which preserve mass, momentum, and energy. 

Let M = N d be the total number of grid points, let Q = (<5i, . . . , Qm) T be 
the result of the spectral formulation from the previous section, written in vector 
form, and let ojj be the quadrature weights over the domain in this ordering. 
Define the integration matrix 




where v l , i = 1, 2, 3 refers to the ith component of the velocity vector. Using this 
notation, the conservation method can be written as a constrained optimization 
problem. 

Find Q = (Q lt . . . , Qm) T that minimizes — ||Q — Q||I such that CQ = (21) 
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The solution is given by 

Q = Q + C(CC T ) _1 CQ 

= PnQ (22) 

Overall the collision step in semi-discrete form is given by 

df 

g- t = PnQ (23) 

The overall cost of the conservation portion of the algorithm is a 0(N d ) 
matrix- vector multiply, significantly less than the computation of the weighted 
convolution. 

3.4 Spatial and Transport discretization 

For simplicity this will be presented in ID in space, though the ideas apply to 
higher dimensions. In this case the transport equation reduces to 

d f d 

— (x,v,t)+v 1 —f(x,v,t) = 0. 

We partition the domain into cells of size Axj (not necessarily uniform) 
with cell centers Xj . Using a finite volume approach, we integrate the transport 
equation over a single cell to obtain 

*7 + i /2 -i7-i /a __ 

At + Ax] ~ °' (24) 

where t n = nAt and F^ ±1 ^ 2 is an approximation of the edge fluxes v\f of the 
cell between time f" and t" +1 . We use a second order upwind scheme defined 
by 

0+1,2 - otherwise, 1 ; 

where <jj is a cell slope term used in the reconstruction defined by the minmod 
limiter. 



aj = minmod - J+1 Jj , tl ilzl , iJ+1 UzL (26) 



£n fn fn fn fn f 

Jj±t Jj jj Jj-i ttik 

x j+1 

For reconstructions at the boundary of the physical domain, ghost cells and 
extrapolation are used to determine the reconstructed slope [27]. 

On wall boundaries the incoming flux is determined using the diffusive reflec- 
tion formula (6). For problems without meaningful boundary interactions (e.g. 
shocks), a no-flux boundary condition is applied for the incoming characteristics. 
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4 Parallelization 



4.1 Shared Memory Parallelization 

The most computationally expensive term to evaluate in one time step of the 
Boltzmann solver is the weighted convolution in Q. This requires computing the 
sum of N 3 terms: X)m=o <5(£ m , Ck)/(£m)/(Ck-£m)w m for each of the N 3 values 
of Ck on the numerical grid. Each sum draws from essentially the same data 
and recombines it in a different way, which makes it ideal for shared memory 
parallelization. 

Shared memory parallelization refers to utilizing an architecture in which 
multiple processes can simultaneously access the same address space in memory. 
This type of parallelization has grown more prominent over the past decade as 
multicore processors are becoming more and more prevalent in typical personal 
computers, and they are the building blocks of larger scale high performance 
computing systems. The primary motivation for this type of architecture is that 
memory access speed is much less than the speed of a processor, and having 
a common pool of memory mitigates the need for memory transfers between 
processors of the system. In this architecture, computational work is divided 
into a discrete number of threads that are distributed to the available processors, 
which compute each thread separately. Logistically, the most difficult hurdles to 
overcome in this computing paradigm are race conditions, where one thread may 
not recieve the correct data depending on whether another thread has modified 
it before access, load balancing, where work is not distributed equally among 
the threads, and blocking, where one thread must wait on another thread to 
complete some work before it can begin. 

We utilize the OpenMP API [32] to parallelize the weighted convolution. 
This interface is available on most computing architectures and compilers, and 
in many cases can give a significant speedup for only adding two or three simple 
lines of code. Typically the same source code can be executed on anything 
ranging from a personal laptop to a high performance computing node, which 
illustrates the great portability of this model. 

For our application, the hurdles associated with parallelization mentioned 
above are not a problem. We avoid load balancing problems by evenly subdi- 
viding the work among the available cores, and each convolution is expected to 
take the same amount of operations as any other. Each thread makes a com- 
putation based the weight G and distribution function /, and is not dependent 
on communication with other threads, so there is no need to worry about race 
conditions or thread blocking. Taking this into account the speedup would be 
expected to scale linearly with p, the number of cores that can access the mem- 
ory, however memory access between the shared memory units in the node still 
present latency issues on the total speedup, as observed in the test below. 

In Tables 1-4 we present the timing results of a single evaluation of the col- 
lision operator for N = 16 and N = 24 points in each velocity direction. These 
tests are performed on the Texas Advanced Computing Center's supercomput- 
ers Ranger and Stampede. On Ranger, a single computational node contains 
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four AMD Opctcron quad-core 64 bit processors, for a total of 16 cores with a 
frequency of 2.3 GHz each, 32 GB of shared memory, and a peak performance of 
128 GFLOPS per node. Stampede is TACC's newest system, officially deploying 
in January 2013 and uses the new Intel MIC architecture. Each computational 
node consists of two Intel Xeon E-5 Processors and a Xeon Phi Coprocessor (the 
MIC component). The E-5 processors have 8 cores each at 2.7 GHz and 32 GB 
of memory, for a total of 16 cores, and the Phi Coprocessor contains 61 cores at 
1.1 GHz and 8GB of fast-access memory for a total of ^1070 GFLOPS of perfor- 
mance by itself. The peak performance of the entire system was benchmarked 
at 3959 Teraflops in November 2012, making it the 7th fastest supercomputer 
in the world [1], and has more components yet to be added. As Stampede is 
still in its trial period the necessary libraries have not been created to run this 
code on the coprocessor, so the following results are run only on the 16 cores of 
the E-5 processors. 



cores 


time (s) 


ratio 


total speedup 


1 


0.068 






2 


0.044 


1.54 


1.54 


4 


0.027 


1.63 


2.52 


8 


0.019 


1.42 


3.58 


16 


0.018 


1.05 


3.78 


Table 1: 


Time for single evaluation of collision operator 


on one node with 


OpenMP. Ranger with N=16 






cores 


time (s) 


ratio 


total speedup 


1 


0.90567 






2 


0.4899 


1.85 


1.85 


4 


0.3021 


1.62 


3.0 


8 


0.19197 


1.57 


4.72 


16 


0.1769 


1.09 


5.12 


Table 2: 


Time for single evaluation of collision operator 


on one node with 


OpenMP. Ranger with N=24 






cores 


time (s) 


ratio 


total speedup 


1 


0.03337 






2 


0.01627 


2.05 


2.05 


4 


0.00976 


1.67 


3.41 


8 


0.00599 


1.63 


5.57 


16 


0.00408 


1.47 


8.18 



Table 3: Time for single evaluation of collision operator on one node with 
OpenMP. Stampede with N=16 
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cores 



time (s) 



ratio 



total speedup 



1 

2 
4 
8 
16 



0.34386 

0.17446 

0.103 

0.06107 

0.04611 



1.97 
1.69 
1.68 
1.32 



1.97 
3.34 
5.63 
7.46 



Table 4: Time for single evaluation of collision operator on one node with 
OpenMP. Stampede with N=24 

4.2 Distributed Memory Parallelization 

The total number of operations required to compute the all of the collisional 
terms in a single timestep is 0(MN 6 ), where M is the total number of physical 
space grid points. When dividing the computational work, in shared memory 
parallelization we are restricted to the number of processors that can physically 
access the shared memory. Distributed memory parallelization on the other 
hand consists of starting multiple processes that can communicate with each 
other, each with their own private address space. 

The main drawback of distributed memory is memory access time. Unlike 
shared memory, distributed memory is less local and thus requires much more 
time to receive information from a process that is not in its address space when 
compared to a similar shared memory access. Therefore, it is important to 
design algorithms that limit communication between processes as much as pos- 
sible. Furthermore, many of the same issues from shared memory programming 
still apply, such as load balancing and blocking. 

Recall that the collision term in the Boltzmann equation is local in space, 
thus each grid point in physical space only requires /(x, •) and the precomputcd 
weights G to evaluate the collision term. This allows for a natural decompo- 
sition of the computational domain by separating across physical grid points. 
The only communication between the computational nodes is in the transport 
term (24). Each process simply sends and receives the 0(N 3 ) values of /(x, £) 
at the edges of the decomposed domain required to calculate the spatial flux. 
The communication between processes is managed by the Message Passing In- 
terface (MPI) protocol [17] using an interleaved ghost cells technique [23]. This 
mitigates the possibility of message deadlock by ensuring that all even indexed 
processes send or receive data at the same time, while the odd indexed proces- 
sors receive data from or send data to the even processors, respectively. The 
code below fills the edge cells on nodes to either side. Extrapolation is used to 
fill ghost cells at the physical domain boundaries, which is removed from the 
code below for clarity of presentation. 



if ((rank % 2) == 0) { //EVEN NODES 

MPI_Ssend(f [nX_node+l] ,N*N*N,MPI_D0UBLE,rank+l,0,MPI_C0MM_W0RLD) ; 
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MPI_Recv(f [1] ,N*N*N,MPI_DOUBLE,rank-l,0,MPI_COMM_WOR,LD, festatus) ; 

MPI_Ssend (f [nX_node] , N*N*N , MPI_DOUBLE , r ank+ 1,0, MPI_CDMM_WDRLD) ; 

MPI_Recv(f [0] ,N*N*N,MPI_D0UBLE,rank-l,0,MPI_C0MM_W0RLD, festatus) ; 

MPI_Ssend (f [2] , N*N*N , MPI_D0UBLE , rank- 1,1, MPI_C0MM_W0RLD) ; 

MPI_Recv(f [nX_node+2] ,N*N*N,MPI_D0UBLE, rank+1 , 1 ,MPI_CDMM_WDRLD , festatus) ; 

MPI_Ssend(f [3] ,N*N*N,MPI_DOUBLE,rank-l , 1 ,MPI_C0MM_W0RLD) ; 

MPI_Recv(f [nX_node+3] ,N*N*N,MPI_D0UBLE, rank+1 , 1 ,MPI_CDMM_WDRLD , festatus) ; 

} 

else { //ODD NODES 

MPI_Recv(f [1] ,N*N*N,MPI_DOUBLE,rank-l,0,MPI_COMM_WORLD, festatus) ; 

MPI_Ssend(f [nX_node+l] , N*N*N , MPI_DOUBLE , rank+1 , , MPI_CDMM_WDRLD) ; 

MPI_Recv(f [0] ,N*N*N,MPI_DOUBLE,rank-l,0,MPI_COMM_WORLD, festatus) ; 

MPI_Ssend(f [nX_node] ,N*N*N,MPI_DOUBLE, rank+1, 0,MPI_C0MM_W0RLD) ; 

MPI_Recv(f [nX_node+2] ,N*N*N,MPI_DOUBLE, rank+1 , 1 ,MPI_C0MM_W0RLD, festatus) ; 

MPI_Ssend(f [2] ,N*N*N,MPI_DOUBLE,rank-l , 1 ,MPI_C0MM_W0RLD) ; 

MPI_Recv(f [nX_node+3] , N*N*N,MPI_DOUBLE, rank+1 , 1 ,MPI_C0MM_W0RLD, festatus) ; 

MPI_Ssend(f [3] ,N*N*N,MPI_DOUBLE,rank-l , 1 ,MPI_C0MM_W0RLD) ; 

} 



To make a rough estimate of the total speedup let n be the number of 
processes and np be the number of cores used (assume a fixed number of cores 
per node). Then the speedup can be described by, to leading order, 

^serial = CMN 6 T FLOP 

^parallel 4nN^T MEM + C M N^T FLOP / np' [ ' 

where Tmem is the average time to transfer a double precision value and Tflop 
is the time for a single floating point operation. As n becomes large with N 
and M fixed, more and more data transfer is required, however one would need 
4u 2 pTmem /Tflop ~ CMN 3 before the memory access time would begin to 
dominate the collisional computations. 

We remark that this parallelization approach can work for other collisional 
models that take the form of a weighted convolution in Fourier space, for ex- 



14 



ample collisional models with anisotropic scattering or the Landau equation 
[18,20]. 

We test the this method with the sudden change in wall temperature example 
suggested by Aoki et al. [3]. These authors computed this example using 
the BGK approximation of the collision operator and finite differences. This 
was later computed in [22] with a preliminary first order serial version of this 
conservative spectral method for the fullt nonlinear Boltzmann operator. In this 
problem the gas is assumed to be initially at equilibrium, and the temperature of 
the wall at the boundary of the domain is instantaneously changed at t = 0. This 
gives rise to a discontinuous distribution function at the wall, which propagates 
into the domain and eventually forms a shock. This problem is difficult to 
compute with a Monte Carlo method due a distribution function that is far 
from equilibrium near the wall and the fact that the the shock develops over a 
long period of time. 

In Figure 4.2 we show shock formation due to a sudden change in wall 
temperature. Unlike the computations in [3], only two grid points are used 
per mean free path in the interior of the domain. Near the wall, the grid is 
refined to eight points per mean free path to better capture the finer dynamics 
where the distribution function is discontinuous. In both examples, we set the 
scaled Knudsen number to e — 1. Note that despite the discontinuity we do not 
observe any Gibbs phenomena in the solution. We hypothesize that this is due 
to the mixing effects of the convolution weights; this will be explored in future 
work. 




Figure 1: Formation of shock from sudden heating of wall. Evolution of bulk 
velocity 

In Table 5 we show the wall time scaling results for the sudden heating 
example. We see a near perfect linear scaling as more nodes are added. Overall 
the openMP/MPI hybrid scheme gives a speedup of ~ An, where n is the total 
number of nodes. Based on the openMP results above, we would expect a 8n 
increase on the Stampede processors. For the computations above, we used 32 
nodes, or 512 cores, to obtain a speedup of ~ 128 compared to the computations 
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Figure 2: Formation of shock from sudden heating of wall. Evolution of kinetic 
temperature. 




Figure 3: Results from sudden cooling of wall. Evolution of kinetic temperature. 
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Figure 4: Sudden heating: evolution of discontinuous marginal distribution near 
the wall. The plots from bottom to top are the marginal distribution of / in 
eight equispaced cells on x = [0, 1] 
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in [22]. 



nodes 



cores 



time (s) 



1 

2 

4 

8 

16 

32 

64 

128 



16 

32 

64 

128 

256 

512 

1024 

2048 



456.313 

235.315 

120.762 

61.345 

30.943 

15.252 

7.813 

4.042 



Table 5: Computational time for a single timestep in sudden heating example 
from Figure 1. 



5 Conclusions 

We have extended the spectral method of Gamba and Tharkabhushanam to a 
second order scheme with a nonuniform grid in physical space, and investigated 
its scaling to high performance computing. The method showed nearly linear 
speedup across nodes when applied to a large computation problem solved on 
a supercomputer. However, at some point memory access will still become a 
problem, not with transfer between nodes but with memory access on a single 
node. The most expensive object to store is the six dimensional weight array 
^(C>0) an d if N becomes too large it may not fit on a single node's memory, 
significantly slowing down computation. At that point, it may be faster to 
simply compute the weights on the fly, as flops are much cheaper than memory 
accesses on the large distributed systems. For the isotropic scattering cross 
sections considered in this paper this would require computation of at most a 
one dimensional integral for each velocity grid point, and in the hard sphere 
and Maxwell molecules we have an exact formula to find the weights. Memory 
management will become especially important when implementing the code on 
MICs, which have much more processing power but relatively smaller memory. 
This will be a subject of future work. 
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